Genome-wide association study identifies loci and candidate genes for grain micronutrients and quality traits in wheat (Triticum aestivum L.)

Malnutrition due to micronutrients and protein deficiency is recognized among the major global health issues. Genetic biofortification of wheat is a cost-effective and sustainable strategy to mitigate the global micronutrient and protein malnutrition. Genomic regions governing grain zinc concentration (GZnC), grain iron concentration (GFeC), grain protein content (GPC), test weight (TW), and thousand kernel weight (TKW) were investigated in a set of 184 diverse bread wheat genotypes through genome-wide association study (GWAS). The GWAS panel was genotyped using Breeders' 35 K Axiom Array and phenotyped in three different environments during 2019–2020. A total of 55 marker-trait associations (MTAs) were identified representing all three sub-genomes of wheat. The highest number of MTAs were identified for GPC (23), followed by TKW (15), TW (11), GFeC (4), and GZnC (2). Further, a stable SNP was identified for TKW, and also pleiotropic regions were identified for GPC and TKW. In silico analysis revealed important putative candidate genes underlying the identified genomic regions such as F-box-like domain superfamily, Zinc finger CCCH-type proteins, Serine-threonine/tyrosine-protein kinase, Histone deacetylase domain superfamily, and SANT/Myb domain superfamily proteins, etc. The identified novel MTAs will be validated to estimate their effects in different genetic backgrounds for subsequent use in marker-assisted selection.


Materials and methods
Planting material and conduct of experiment. A set of 184 diverse bread wheat genotypes consisting of old and new Indian elite varieties, exotic lines, landraces, synthetic hexaploid, and derived lines were used for GWAS (Supplementary Table S1). The GWAS panel was evaluated for GZnC, GFeC, GPC, TKW, and TW during 2019-20 crop season at three diverse locations viz., IARI-New Delhi (E1) (Indian Agricultural Research Institute, Research Farm, New Delhi located 29.7008° N, 76.9839° E, 228.6 m AMSL), IARI-Indore (E2) (Indian Agricultural Research Institute, Regional Station, Indore located 22.7196° N, 75.8577° E, 228.6 m AMSL) and GBPUAT-Pantnagar (E3) (Govind Ballabh Pant University of Agriculture and Technology, Research Farm, Uttarakhand, 29.0229° N, 79.4879° E, 243.8 m AMSL). Each genotype was grown in a 5 rows plot of 2.5 m each, with a row-to-row distance of 0.25 m following augmented design with repeated checks namely, HD 3086, C 306, HI 1544, and GW 322. The pests and diseases were controlled chemically, whereas weeds were controlled both manually and chemically. Plant materials were harvested after the grains reached physiological maturity and were completely dry in the field.
Phenotyping. Twenty spikes of each entry were manually harvested, threshed, and carefully cleaned by discarding broken grains and foreign material without touching to metal parts of the farm equipment and used for micronutrient analysis. GZnC and GFeC were measured with a "bench-top" non-destructive, energy-dispersive X-ray fluorescence spectrometry (ED-XRF) instrument (model X-Supreme 8000; Oxford Instruments plc, Abingdon, United Kingdom) standardized for high-throughput screening of mineral concentration of wholegrain wheat 35 . The GZnC and GFeC were expressed in milligrams per kilogram (mg/kg). The GPC was measured by Infra-red transmittance-based instrument Infra-tec 1125 and expressed in percentage (%). The TKW was measured by weighing a set of randomly selected 1000 grains representing the whole grain sample in the Numigral grain counter. To record the TW, a thoroughly cleaned grain sample was poured into the metallic funnel of the hectoliter weight instrument developed by the ICAR-Indian Institute of Wheat and Barley Research, Karnal. After the grain was levelled well, the outlet was opened to allow the free flow of the grain in the metallic tubular container below till it is filled. Then the shutter was slid to remove the excess grain and to level it. The grain contained in the measuring can was then weighed using an electronic balance. The TKW was expressed in grams (g) and TW as kilogram per hectoliter (kg/hl).
Phenotypic data analyses. The phenotypic data was analysed with ACBD-R (Augmented Complete Block Design with R) version 4.0 software 36 . The mean, coefficient of variation (CV), least significant difference (LSD), genotypic variance, and heritability were estimated. In ACBD-R v4.0, the best linear unbiased predictors (BLUPs) of each genotype were calculated for each environment and across environments along with four checks varieties (HD 3086, C 306, HI 1544, and GW 322). The calculated BLUPs were then used in the GWAS analysis. The frequency distribution graphs and correlation coefficients of the recorded traits were obtained through Past 3.01 software 37 .
In silico analysis. A 100 bp sequence was extracted from Ensemble Plants database (http:// plants. ensem bl. org/ index. html) of the bread wheat genome (IWGSC (RefSeq v1.0)) and added on both sides of the SNP for in silico analysis. Insilico search for the putative candidate genes was then done using Basic Local Alignment Search Tool (BLAST) in the Ensemble plant database (https:// plants. ensem bl. org/ index. html). The genes found in the overlapping region and within 1 Mb upstream and downstream of the matched regions were selected as candidate genes and their molecular functions were determined. In addition, their expression patterns were investigated using the Wheat Expression database (http:// www. wheat-expre ssion. com/) and potential links to phenotypes were determined using Knetminer tool integrated with Wheat Expression database. The role of the identified putative candidate genes in the regulation of GZnC and GFeC was also determined with the previous reports.

Results
Phenotypic evaluations. A set of 184 diverse genotypes in the GWAS panel were evaluated for nutritional and other grain quality traits in three diverse locations viz., IARI-New Delhi (E1), IARI-Indore (E2), GBPUAT-Pantnagar (E3), and the combined data across three environments (E4). The summary statistics including mean, range, coefficient of variance, heritability, and variance estimates are presented in Table 1. The genotypic variance is significant for all the studied traits. An extensive range of variation was observed for all the traits in all the studied environments (Table 1; Fig. 1). The variation for GZnC, GFeC, GPC, TKW, and TW ranged from 17.56 mg/kg to 56.93 mg/kg, 25.47 mg/kg to 52.09 mg/kg, 8.6% to 15.81%, 24.33 g to 57.18 g and 64.48 kg/hl to 83.77 kg/hl, respectively. The heritability estimates for GZnC, GFeC, and GPC were variable and ranged between 0.5-0.88, 0.4-0.8, and 0.56-0.82, respectively, heritability for TKW and TW were greater and ranged between 0.73 and 0.92. Based on the combined BLUP values over the environments, the ten best-performing lines for the traits were selected and presented in Table 2. The landrace Navrattan and Syntheic Hexaploid Wheat (SHW) 2.38 were among the best performing genotypes for GZnC, GFeC, and GPC. The Indian variety Lokbold was found among the best performing lines for GZnC, GFeC, and TKW.
The Pearson correlation coefficients (r) estimated between the traits in each environment are presented in Table 3. The association was positive and highly significant (p < 0.01 to 0.001) among GZnC, GFeC (Fig. 2), and GPC in each environment and across environments. The GFeC was found to have a consistently significant positive correlation with TW and TKW (p < 0.05-0.01) in E1, E2, and E4, however, GZnC did not show any correlation with either TW or TKW in any of the environments. Further, the association of GPC with TW and TKW was negative and significant in E2, E3, and E4 (p < 0.05-0.01). The TKW showed a strong positive correlation with TW in all the environments (p < 0.01-0.001).
Marker distribution, LD decay and population structure. A set of 9,503 high quality SNP markers were distributed across the genome with the highest number of markers on the B sub-genome (3646), followed by A (2979) and D (2878) sub-genomes, respectively. Chromosome-wise distribution suggests that the highest number of markers were mapped on chromosome 1B (675) followed by chromosome 2B (653) and 1D (610). Chromosomes 4D (170) and 4B (266) had the least number of markers ( Table 4).
The LD was estimated by calculating the squared correlation coefficient (r 2 ) for all the 9503 markers. Genomewide LD decayed with genetic distance, the LD decayed to its half at 4.71 Mb for whole genome, and 3.63 Mb for A, 5.63 Mb for B and 4.90 Mb for D sub-genomes (Fig. 3).  (Fig. 4a). The three groups consisted of 45 (G1), 20 (G2) and 119 (G3) genotypes respectively. The PC1, PC2 and PC3 accounted for 10.63%, 8.72% and 5.45% of the total variation respectively. The first three principal components were used as covariates in GWAS analysis to reduce the false positives. The Clustering methods (N-J tree) also revealed the three subpopulations, thus confirming the results of PCA (Fig. 4b). The G1 has most of the exotic lines, G2 constituted of some of the new Indian varieties and G3 was dominated by breeding lines. The Indian varieties were distributed in all three groups.
Marker-trait associations. A total of 55 MTAs were detected; 4 for GFeC, 2 for GZnC, 23 for GPC, 15 for TKW, and 11 for TW. The details of these MTAs are summarized in Table 5 and depicted as Manhattan plots in Fig. 5. The Q-Q plots illustrating observed associations between SNPs and grain micronutrient concentrations compared to expected associations after accounting for population structure are presented in Fig. 5.

MTAs for GFeC.
A total of four significant MTAs were identified for GFeC in E1, E3, and E4 environments on chromosomes 2B, 3A, 3B, and 6A (Table 5; Fig. 5). The phenotypic variation (PV) explained by these SNPs ranged between 8.82 and 12.62%. A major SNP on chromosome 3A (AX-95002032) located at 637.9 Mb explained 12.62% of the PV, while another SNP on chromosome 6A (AX-94715803) located at 585.4 Mb explained 11.14% of PV, both were detected in E3. The other SNPs, AX-94761251 on 2B and AX-94850629 on 3B explained the PV of 9.26 and 8.82%, respectively. The SNP, AX-95002032 had A and C alleles with a phenotypic average of 30.68 mg/kg and 36.14 mg/kg respectively. The SNP, AX-94715803 had A and G alleles with a phenotypic average of 30.6 and 34.76 mg/kg respectively (Fig. 6).
The two genotypes i.e. the Navrattan (landrace) and 2.38 (SHW) were among the best performing genotypes for GZnC (36.05 and 37.72 mg/kg), GFeC (37.87 and 38.27 mg/kg) and GPC (12.32% and 12.22%) based on the combined BLUPs across environments and hence can be efficiently utilized in breeding programmes. The genotype Lokbold was identified to be another good performer for GFeC (37.76 mg/kg), GZnC (35.26 mg/ kg), and TKW (49.8 g) and therefore can also be given due consideration in breeding programmes ( Table 2). The performance of many old Indian varieties was found to be better for GZnC, GFeC, and GPC than recently released cultivars and vice versa was the case for TKW and TW, confirming the dilution effect, i.e. increased efforts of plant breeders in enhancing grain yield led to an unintentional increase of more starchy endosperm, thus reductions in other important quality components in modern wheat varieties 54 .
The significant positive correlations among GZnC, GFeC, and GPC indicate the possibility of simultaneous improvement of these traits. This finding is in line with earlier reports 44,[49][50][51][52] . Additionally, many studies suggested a common genetic basis for these traits through GWAS and conventional QTL studies 50,51,55,56 . Also, GFeC and GZnC showed either positive or no correlation with TKW and TW, indicating that the grain Zn and Fe can be increased without yield penalty. However, the study also shows the negative association of TW and TKW with GPC indicating yield penalty with the increase in GPC beyond a certain level 34 , thus it is suggested to improve protein quality profile with the optimum level of protein quantity required for the superior-quality end product.
The population structure inferred by PCA revealed three sub-populations in the GWAS panel (Fig. 4). Similarly, NJ-based clustering divided the whole set of 184 genotypes into three distinct clusters. The genotypes were clustered in the previous studies mainly based on pedigree, geographical, and evolutionary origin 29,31-33 . In the current study, G1 was dominated by exotic lines, G2 constituted new Indian varieties and G3 by breeding lines. Most significantly, the Indian varieties were mixed up with all the three groups, thus pointing towards their broad genetic base. The results also suggest that many new Indian varieties might have been bred by introgressing genes from exotic lines.
The LD decay over genetic or physical distance in a population determines the density of marker coverage needed to perform GWAS. A faster LD decay indicates that a higher marker density is required to capture the markers close enough to the causal loci 57 . In the present study, the LD decayed to its half from the maximum LD at 4.71 Mb for whole genome, 3.63 Mb for A, 5.63 Mb for B and 4.90 Mb for D sub-genomes (Fig. 3). A similar LD pattern of 5.98 Mb was reported in a set of Chinese wheat landraces 58 . In contrast, the whole genome LD decay was faster and it was at the distance of 2 Mb in a set of CIMMYT spring bread wheat lines 59 . In addition,   Table 5. Marker trait associations (MTAs) detected for GFeC, GZnC, GPC, TKW, and TW in IARI-New Delhi (E1), IARI-Indore (E2), GBPUAT-Pantnagar (E3), and across environments (E4).  www.nature.com/scientificreports/ the whole genome LD decay distance of 3 Mb was reported 60 . In contrast to faster LD decay, the slower LD decay distance of 22 Mb and 23 Mb respectively were found in a set of hexaploid wheat collections from Kazakhstan and in Mexican bread wheat landraces 61,62 . The variation in the LD pattern among different GWAS populations may be due to factors like selection, mutation, admixtures, non-random mating, etc. A total of 55 MTAs were identified, 4 for GFeC, 2 for GZnC, 23 for GPC, 15 for TKW, and 11 for TW. The four MTAs were identified for GFeC on chromosomes 2B, 3A, 3B, and 6A and explained the PV ranging between 8.82 and 12.62% (Table 5; Fig. 5). Previously, MTAs for GFeC were reported on chromosome 3A 28 , on chromosome 3B 1,27 , further QTL were reported on 2B and 6A by 51,63 . The major SNP on chromosome 3A (AX-95002032) located at 637.96 Mb explained 12.62% of the PV, and the putative candidate gene linked with this marker is TraesCS3A02G389000 (F-box-like domain superfamily). Interestingly, the E3 ubiquitin ligase complex containing the FBXL5 (F-box and leucine-rich repeats protein 5) protein targets iron regulatory protein (IRP2), the FBXL5 accumulating under iron-and oxygen-replete conditions and degraded upon iron depletion 64,65 . These observations also hint at the possible role of FBXL5 in iron sensing in plant systems. The SNP AX-94715803 on chromosome 6A located at 585.43 Mb explained 11.14% of PV. The putative candidate gene linked with this marker is TraesCS6A02G353900 (Zinc finger CCCH-type, G-patch domain). It is noteworthy that the zinc finger transcription factors, which control the functions of various genes, have a DNA binding domain that requires zinc or iron ions for its structural and functional stability and for activation 66 . The possible role of zinc finger protein in wheat grain zinc accumulation was also reported earlier 31,50,51 . The other two MTAs, AX-94761251, AX-94850629 found on chromosome 2B at 458.62 Mb and on chromosome 3B at 473.69 Mb explained a respective PV of 9.26 and 8.82%, respectively. In silico analysis revealed the putative candidate genes TraesCS2B02G321500 (Domain of www.nature.com/scientificreports/ unknown function DUF3475, DUF668) for AX-94761251 and TraesCS3B02G295000 (Serine-threonine/tyrosineprotein kinase) for AX-94850629. The possible role of protein kinases in grain iron and zinc accumulation in wheat was also indicated by other authors 1,27,28,50,51,67 . The protein kinases phosphorylate Fe and Zn proteins and are found to show greater interactions with Fe and Zn transporter proteins 68 , hence these are expected to have a potential role in grain micronutrients accumulation. The two SNPs associated with GZnC found on chromosomes 1A and 7B with PV ranged from 6.35 to 7.60% (Table 5; Fig. 5). Previous studies also reported the QTLs on these chromosomes 28,31,33,51 . An SNP on chromosome 7B (AX-94422893) located at 488.41 Mb explained 7.60% of PV, this region encodes TraesCS7B02G266000 (Histone deacetylase domain superfamily, Ureohydrolase domain superfamily). Histone deacetylases play an important role in gene regulation. The zinc ion acts as a cofactor and regulates the catalytic function of the classical HDAC family of enzymes (Class I, II, IV) 69 . Another SNP on chromosome 1A (AX-94651424) located at 544.72 Mb explained 6.37% of PV and codes for TraesCS1A02G365900 (SANT/Myb domain, Homeobox-like domain superfamily). Interestingly, Arabidopsis Myb transcription factors positively regulate the biosynthesis of glucosinolates 70 which in turn are involved in trade-off with Zn. This is evident from the studies on Thlaspicaerulescens, where Zn hyper accumulation decreased sinalbin (p-hydroxybenzylglucosinolate) concentration in shoots 71 .
Additionally, 23 MTAs identified for GPC, 15 for TKW, and 11 for TW. They explained the PV of up to 18.19%, 12.07%, and 13.79% respectively. A major SNP AX-94939463 identified for TKW on chromosome 7A at 731.88 Mb, stably found in three environments namely, E1, E3, and E4, and explained the PV of 12.07, 10.13, and 8.09% in the respective environments. This SNP was found in the region codes for TraesCS7A02G560100 (Polysaccharide biosynthesis domain). Interestingly, the polysaccharide synthesizing glycosyltransferases are the enzymes generally organized on Golgi bodies that catalyze the synthesis of more complex and highly branched polysaccharides 72 . Previously, two candidate genes namely, TaSus1 and TaGASR-A1 were reported on chromosome 7A 73 .
The positive and highly significant correlation between GFeC, GZnC and GPC (P < 0.001) in all the environments suggested the possibility of simultaneous improvement of these traits. The best performing lines like Navrattan, SHW 2.38, and Lokbold can be utilized as sources in the breeding pipeline and for developing mapping populations to discover QTLs for grain Zn and Fe concentration and protein content. Further, the promising SNPs on chromosome 1A, 7B for GZnC and 2B, 3A, 3B, and 6A for GFeC could be converted into breeder's friendly Kompetitive Allele Specific PCR (KASP) markers to be used in marker-assisted selection (MAS) or targeted introgression to develop biofortified cultivars. The putative candidate genes identified need to be validated further to shed light on their functional role in grain Fe and Zn concentration. License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.